Triton v3.6.x iluvatar backend and 5 TLE primitives support by Salamanca001 · Pull Request #724 · flagos-ai/FlagTree

Salamanca001 · 2026-06-26T10:37:59Z

This PR brings the Iluvatar backend support onto Triton 3.6 in FlagTree and adds Iluvatar TLE lowering support.
Included commits:

7b4cac885 [BACKEND] update iluvatar backend support on triton3.6.
13abd15d8 [TLE][ILUVATAR] Add TLE support for alloc, local_ptr, copy, extract_tile and insert_tile.
Main changes:
Add third_party/iluvatar backend integration, including compiler/driver entry points, Iluvatar GPU dialect, lowering passes, target info, utility code, build wiring, and test runner.
Add Iluvatar TLE dialect/lowering support for alloc, local_ptr, copy, extract_tile, and insert_tile.
Add local pointer optimization passes, barrier insertion, TLE-to-LLVM lowering utilities, and registration into the Iluvatar build/dialect pipeline.
Wire Iluvatar setup/build support into top-level CMake and Python setup helpers.

CLAassistant · 2026-06-26T10:38:06Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

sunnycase

Thanks for the work here. Before this is finalized, could you please add a concise summary of the TLE primitive implementation plan?

It would be helpful to cover the main design points, such as the abstraction/lowering flow, compiler/runtime integration points, supported operator scope, dtype/shape/backend limitations, and the validation approach. Could you also include performance data for a few representative operators, ideally with baseline vs. TLE primitive numbers, test shapes, hardware/backend configuration, and measurement methodology?

For the expected level of detail and presentation style, PR #617 could be a useful reference: #617

…ile and insert_tile.

Salamanca001 · 2026-06-30T03:55:32Z

Thanks for the work here. Before this is finalized, could you please add a concise summary of the TLE primitive implementation plan?

It would be helpful to cover the main design points, such as the abstraction/lowering flow, compiler/runtime integration points, supported operator scope, dtype/shape/backend limitations, and the validation approach. Could you also include performance data for a few representative operators, ideally with baseline vs. TLE primitive numbers, test shapes, hardware/backend configuration, and measurement methodology?

For the expected level of detail and presentation style, PR #617 could be a useful reference: #617

[TLE][ILUVATAR] Support TLE Structure on iluvatar backend

This patch adds TLE (Triton Language Extension) structure support to the iluvatar
backend, covering five primitives:

tle.gpu.alloc
tle.gpu.local_ptr
tle.gpu.copy
tle.extract_tile
tle.insert_tile

1. Overview

The iluvatar backend reuses the shared TLE Python frontend under
python/triton/experimental/tle/ and plugs in backend-specific MLIR builder
bindings and lowering through a dedicated MLIR dialect iluvatar_tle.

Python API: shared (no third_party/iluvatar/python/.../tle override).
Builder overrides: third_party/iluvatar/tle/triton_iluvatar_tle.cc.
MLIR dialect: iluvatar_tle (third_party/iluvatar/tle/include/IR/IluvatarTleOps.td).
Optimization passes: 3 local-pointer passes registered in the TTGIR pipeline.
LLVM lowering: third_party/iluvatar/tle/lib/Conversion/TleToLLVM/.

2. Supported primitives

2.1 `tle.gpu.alloc`

MLIR: lowers directly to ttg.local_alloc (mutable memdesc in shared memory),
via create_local_alloc in triton_iluvatar_tle.cc.
Coverage: shared-memory buffers with swizzled shared layout
(make_swizzled_shared_encoding_attr); optional initialized allocation
(local_alloc(value)).
Limitations (enforced in triton_iluvatar_tle.cc):
- tmem / tensor_memory storage is rejected:
  "iluvatar TLE alloc does not support tmem storage".
- nv_mma_shared_layout=True is rejected:
  "iluvatar TLE alloc does not support nv_mma_shared_layout=True".
- Any non-smem storage string is rejected:
  "iluvatar TLE alloc only supports smem storage".
- In practice tests use nv_mma_shared_layout=False, which routes through the
  swizzled shared layout default builder.

2.2 `tle.gpu.local_ptr`

MLIR: lowers to iluvatar_tle.local_pointers (create_local_pointers).
Coverage:
- Full-view pointer when indices=None (result shape equals buffer shape).
- Indexed pointers with scalar or block (tensor) indices.
- 1D and 2D buffers; load / store; masked tails; loops; dot operands; runtime round-trips.
Limitations (Python checks in core.py): integer-typed indices only; index
rank must match buffer rank; scalar and tensor indices cannot be mixed; only
shared-memory buffers are supported.

2.3 `tle.gpu.copy`

Normal copy (tensor <-> buffered_tensor): lowered via the load/store +
local-pointer path (normcopy):
- GM -> local: tl.load(src) -> local_ptr(dst) -> tl.store.
- local -> GM: local_ptr(src) -> tl.load -> tl.store(dst).
- There is no dedicated TLE MLIR copy op on iluvatar.
Descriptor / TMA copy: not supported on iluvatar.

2.4 `tle.extract_tile`

Coverage: static multi-dim index, static scalar index, dynamic scalar index
(tl.tensor), and dynamic multi-dim index (auto-linearized).
LLVM lowering (tle/lib/Conversion/TleToLLVM/ExtractTileToLLVM.cpp):
- Tensors must be ranked, must carry an encoding, and the encoding must be
  BlockedEncodingAttr (other encodings rejected with explicit errors).
- Static + CTA-tile-aligned indices -> register-shuffle path.
- Dynamic or misaligned indices -> shared-memory relay path.

2.5 `tle.insert_tile`

MLIR: iluvatar_tle.insert_tile (create_insert_tile); the tile shape is
inferred from the tile operand (no separate tile_shape attribute), and the
op declares InferTypeOpInterface.
Coverage: static multi-dim index, static scalar index, dynamic scalar index.
LLVM lowering (InsertTileToLLVM.cpp): same BlockedEncodingAttr-only
constraint; static CTA-aligned register-merge path vs. dynamic shared-memory path.

3. Lowering path

Key differences from the native Triton path:

Python TLE API
  alloc       -> ttg.local_alloc                                  (direct)
  local_ptr   -> iluvatar_tle.local_pointers
  copy (norm) -> tt.load + iluvatar_tle.local_pointers + tt.store
  extract_tile-> iluvatar_tle.extract_tile
  insert_tile -> iluvatar_tle.insert_tile

TTGIR passes (iluvatar-specific):
  triton-iluvatar-tle-insert-local-pointer-barriers
  triton-iluvatar-tle-optimize-local-pointer-loads
  triton-iluvatar-tle-optimize-local-pointer-stores

make_llir:
  populateTleToLLVMPatterns  (extract_tile / insert_tile / local_pointers)
  -> register shuffle / merge, or shared-memory relay -> LLVM

4. Tests (run in iluvatar CI when `FLAGTREE_ILUVATAR_TLE` is enabled)

From third_party/iluvatar/test_triton.sh:

Support TLE Tests
`python/test/tle/integration/test_tle_local_store.py`
`python/test/tle/unit/test_tle_gpu_local_ptr.py`
`python/test/tle/unit/test_extract_tile_static_index.py`
`python/test/tle/unit/test_extract_tile_dynamic_index.py`
`python/test/tle/unit/test_insert_tile_static_index.py`
`python/test/tle/unit/test_insert_tile_dynamic_index.py`

5. Performance data

5.1 Measurement methodology

Benchmark sources (backend-agnostic tutorials):
- python/tutorials/tle/01-fft.py

5.2 Environment

Field	Value
Hardware	Iluvatar Corex
Driver / SDK	`4.5.0`
Torch	`2.10.0`
FlagTree	`triton_v3.6.x`

5.3 Representative results

	N	Triton (ms)	TLE (ms)	Torch (ms)
0	`64.0`	`0.045962`	`0.116885`	`0.022308`
1	`128.0`	`0.064135`	`0.129558`	`0.036635`
2	`256.0`	`0.135442`	`0.187827`	`0.056135`
3	`512.0`	`0.427827`	`1.050106`	`0.122798`
4	`1024.0`	`1.268423`	`3.248211`	`0.221808`

Speedup is computed as baseline_time / TLE_time :

Comparison	Mean
TLE FFT vs Triton FFT	0.48x
TLE FFT vs Torch FFT	0.19x

6. Status note

This patch delivers functional support for the five TLE primitives on the
iluvatar backend (correctness validated by unit/integration tests and CI). As the
benchmark results above show, TLE paths are not yet competitive with native Triton
or Torch kernels. Performance optimization is planned for follow-up commits.

Salamanca001 requested review from Galaxy1458, sunnycase and zhzhcookie as code owners June 26, 2026 10:38

github-actions Bot added iluvatar triton_v3.6.x labels Jun 26, 2026

zhzhcookie reviewed Jun 26, 2026

View reviewed changes

Comment thread python/setup_tools/setup_helper.py Outdated

zhzhcookie reviewed Jun 26, 2026

View reviewed changes

Comment thread CMakeLists.txt Outdated

sunnycase reviewed Jun 29, 2026

View reviewed changes

Salamanca001 changed the title ~~Triton v3.6.x iluvatar~~ Triton v3.6.x iluvatar backend and 5 TLE primitives support Jun 29, 2026

huatuoli added 2 commits June 30, 2026 02:40

[BACKEND] update iluvatar backend support on triton3.6.

c51c847

[TLE][ILUVATAR] Add TLE support for alloc, local_ptr, copy, extract_t…

9a2d8c9

…ile and insert_tile.

Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch from 13abd15 to c39b53d Compare June 30, 2026 03:51

Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch 2 times, most recently from 7ed5e70 to cc55e41 Compare June 30, 2026 08:04

[CI/CD] Add iluvatar3.6 ci/cd runner.

c257aa5

Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch from cc55e41 to c257aa5 Compare June 30, 2026 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Triton v3.6.x iluvatar backend and 5 TLE primitives support#724

Triton v3.6.x iluvatar backend and 5 TLE primitives support#724
Salamanca001 wants to merge 3 commits into
flagos-ai:triton_v3.6.xfrom
Salamanca001:triton_v3.6.x_iluvatar

Salamanca001 commented Jun 26, 2026

Uh oh!

CLAassistant commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sunnycase left a comment

Uh oh!

Salamanca001 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Salamanca001 commented Jun 26, 2026

Uh oh!

CLAassistant commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sunnycase left a comment

Choose a reason for hiding this comment

Uh oh!

Salamanca001 commented Jun 30, 2026

[TLE][ILUVATAR] Support TLE Structure on iluvatar backend

1. Overview

2. Supported primitives

2.1 tle.gpu.alloc

2.2 tle.gpu.local_ptr

2.3 tle.gpu.copy

2.4 tle.extract_tile

2.5 tle.insert_tile

3. Lowering path

4. Tests (run in iluvatar CI when FLAGTREE_ILUVATAR_TLE is enabled)

5. Performance data

5.1 Measurement methodology

5.2 Environment

5.3 Representative results

6. Status note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Jun 26, 2026 •

edited

Loading

2.1 `tle.gpu.alloc`

2.2 `tle.gpu.local_ptr`

2.3 `tle.gpu.copy`

2.4 `tle.extract_tile`

2.5 `tle.insert_tile`

4. Tests (run in iluvatar CI when `FLAGTREE_ILUVATAR_TLE` is enabled)